Bulk-synchronous pseudo-streaming algorithms for many-core accelerators
نویسندگان
چکیده
The bulk-synchronous parallel (BSP) model provides a framework for writing parallel programs with predictable performance. In this paper we extend the BSP model to support what we will call pseudo-streaming algorithms for accelerators. We also generalize the BSP cost function to these algorithms, so that it is possible to predict the running time for programs targeting many-core accelerators and to identify possible bottlenecks. Several examples of algorithms within this new framework will be explored. We extend the BSPlib standard by proposing a small number of new BSP primitives to create and use streams in a portable way. We will introduce a software library called Epiphany BSP that implements these ideas for the Parallella development board. Finally we will give experimental results for pseudo-streaming algorithms on the Parallella platform.
منابع مشابه
Unicorn: a Bulk Synchronous Programming Model, Framework and Runtime for Hybrid Cpu-gpu Clusters
Rapid evolution of graphics processing units (GPUs) into general purpose computing devices has made them vital to high performance computing clusters. These computing environments consist of multiple nodes connected by a high speed network such as Infiniband, with each node comprising several multi-core processors and several many-core accelerators. The difficulty of programming hybrid CPU-GPU ...
متن کاملPractical Parallel External Memory Algorithms via Simulation of Parallel Algorithms
This thesis introduces PEMS2, an improvement to PEMS (Parallel External Memory System). PEMS executes Bulk-Synchronous Parallel (BSP) algorithms in an External Memory (EM) context, enabling computation with very large data sets which exceed the size of main memory. Many parallel algorithms have been designed and implemented for Bulk-Synchronous Parallel models of computation. Such algorithms ge...
متن کاملFunctional Bulk Synchronous Parallel Programming in C++
This paper presents the BSFC++ library for functional bulk synchronous parallel programming in C++. It is based on an extension of the λ-calculus by parallel operations on a parallel data structure named parallel vector, which is given by intention. This guarantees the determinism and the absence of deadlock. Broadcast algorithms are implemented using the core library.
متن کاملHybrid algorithms for Job shop Scheduling Problem with Lot streaming and A Parallel Assembly Stage
In this paper, a Job shop scheduling problem with a parallel assembly stage and Lot Streaming (LS) is considered for the first time in both machining and assembly stages. Lot Streaming technique is a process of splitting jobs into smaller sub-jobs such that successive operations can be overlapped. Hence, to solve job shop scheduling problem with a parallel assembly stage and lot streaming, deci...
متن کاملDense Matrix Computation on a Heterogenous Architecture: A Block Synchronous Approach
We present a strategy for efficient use of all components of a heterogenous compute node of a typical current generation cluster. Such nodes often comprise multiple sockets with a multicore processor per socket and one or more accelerators, possibly from different generations and/or types. Our strategy differs from schedulers such as Quark or SuperMatrix in that it does not rely on a Directed A...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1608.07200 شماره
صفحات -
تاریخ انتشار 2016